Add joint tensor and KV cache support to USP method #586

avjves · 2025-11-10T13:31:34Z

What?

Adds support for KV-cache and joint tensors to USP method

Why?

Currently some models, like Hunyuanvideo have diverging code paths based on the input parameters. The two paths (Yunchang / USP) have different implementations for comms. Yunchang path uses certain features from torch.distributed that are not compatible with torch.compile. As USP method can be fully compiled, this PR aims to make USP support the features the Yunchang path does, allowing us to only use USP.

This PR is the first step towards deprecating yunchang in the long term, as discussed in #579 , but does not aim to fully remove it in the short term.

How?

Ported Yunchang features directly to USP method. This includes the joint tensors as well as KV cache for pipeline parallelism. Also changed Hunyuanvideo and Flux to already use only USP rather than Yunchang path.

Tests

Output:

Hunyuanvideo:

Tested both Ring/Ulysses.

Hunyuanvideo already uses USP by default if the input prompt is of a specific shape. The command and output below are from changing the other code path, where it previously used yunchang / hybrid_seq_parallel_attn.

Run command:

torchrun --nproc_per_node=8 examples/hunyuan_video_usp_example.py     --model tencent/HunyuanVideo     --prompt "In the large cage, two puppies were wagging their tails at each other."     --height 720 --width 1280 --num_frames 129     --num_inference_steps 50  --ulysses_degree 8     --enable_tiling --enable_slicing     --use_torch_compile

hunyuan_test_usp.mp4

Flux:

Flux uses standard USP already by default. Only in the case of pipeline parallelism did it use yunchang:

Run command:

torchrun --nproc_per_node=8 examples/flux_example.py     --model black-forest-labs/FLUX.1-dev     --seed 42     --prompt "A small cat"     --height 1024     --width 1024     --num_inference_steps 25     --max_sequence_length 256     --no_use_resolution_binning     --ulysses_degree 2        --pipefusion_parallel_degree 4

Perf

Hunyuanvideo

The swap from yunchang code path to USP code path improves the performance, as now we can use torch.compile for the attention call as well. Here we have timed three subsequent runs and reported the average:

Before: 193.2s
After: 188.3s

This now matches the perf of the original USP path.

Automatic tests

Added two new unit tests as well to compare the output of Yunchang / USP.

Other

This doesn't change the standard USP method behaviour, so other models using USP won't be affected. Some models still use Yunchang, so they would need to be changed in future PRs.

For ease of comparison, here's the original USP method:

    if get_sequence_parallel_world_size() == 1:
        out = _attention(query, key, value, dropout_p=dropout_p, is_causal=is_causal)
    elif get_ulysses_parallel_world_size() == 1:
        out = ring_attn(query, key, value, dropout_p=dropout_p, is_causal=is_causal)
    elif get_ulysses_parallel_world_size() > 1:
        query = _ft_c_input_all_to_all(query)
        key = _ft_c_input_all_to_all(key)
        value = _ft_c_input_all_to_all(value)

        if get_ring_parallel_world_size() == 1:
            out = _attention(query, key, value, dropout_p=dropout_p, is_causal=is_causal)
        else:
            out = ring_attn(query, key, value, dropout_p=dropout_p, is_causal=is_causal)

        out = _ft_c_output_all_to_all(out)

    return out

feifeibear

LGTM

feifeibear · 2025-11-11T06:46:09Z

/gemini review

gemini-code-assist

Code Review

This pull request is a great step towards unifying attention mechanisms by adding joint tensor and KV-cache support to the USP method. The goal of deprecating the yunchang path to improve torch.compile compatibility is well-motivated. The changes in xfuser/model_executor/layers/usp.py are substantial and well-supported by new unit tests that verify equivalence with the old implementation. The modifications in attention_processor.py and transformer_flux.py to adopt the new USP interface are consistent and correct. Overall, the changes are well-executed. I have one suggestion to improve the structure of the new tests for better maintainability.

tests/layers/usp_test.py

jcaraban

LGTM

avjves added 8 commits November 5, 2025 15:23

Add initial USP_join calls

6e1eba6

Use Ulysses rank rather than sequence rank

cc17d20

Add KV-cache support and proper joint tensor support to USP

a3196d7

Make Hunyuan use USP rather than Yunchang

ceee09e

Make Flux use USP rather than Yunchang

ea9310f

Enable the new USP function to support all SPs again

78490a1

Add tests to compare new USP to yunchang

dbd819f

Refactor USP to be less repetitive

80e687a

feifeibear reviewed Nov 11, 2025

View reviewed changes

gemini-code-assist bot reviewed Nov 11, 2025

View reviewed changes

tests/layers/usp_test.py Outdated Show resolved Hide resolved

jcaraban self-assigned this Nov 11, 2025

Remove inline importing in tests

060f20f

jcaraban self-requested a review November 11, 2025 19:22

jcaraban approved these changes Nov 11, 2025

View reviewed changes

feifeibear merged commit b8ebdf7 into xdit-project:main Nov 12, 2025

jcaraban mentioned this pull request Nov 12, 2025

Bump version to 0.4.5 #588

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add joint tensor and KV cache support to USP method #586

Add joint tensor and KV cache support to USP method #586

Uh oh!

avjves commented Nov 10, 2025 •

edited

Loading

Uh oh!

feifeibear left a comment

Uh oh!

feifeibear commented Nov 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

jcaraban left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add joint tensor and KV cache support to USP method #586

Add joint tensor and KV cache support to USP method #586

Uh oh!

Conversation

avjves commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Why?

How?

Tests

Output:

Hunyuanvideo:

Flux:

Perf

Hunyuanvideo

Automatic tests

Other

Uh oh!

feifeibear left a comment

Choose a reason for hiding this comment

Uh oh!

feifeibear commented Nov 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

jcaraban left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

avjves commented Nov 10, 2025 •

edited

Loading